Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 32
Filter
Add more filters










Publication year range
1.
Front Mol Biosci ; 11: 1352508, 2024.
Article in English | MEDLINE | ID: mdl-38606289

ABSTRACT

Antibodies are proteins produced by our immune system that have been harnessed as biotherapeutics. The discovery of antibody-based therapeutics relies on analyzing large volumes of diverse sequences coming from phage display or animal immunizations. Identification of suitable therapeutic candidates is achieved by grouping the sequences by their similarity and subsequent selection of a diverse set of antibodies for further tests. Such groupings are typically created using sequence-similarity measures alone. Maximizing diversity in selected candidates is crucial to reducing the number of tests of molecules with near-identical properties. With the advances in structural modeling and machine learning, antibodies can now be grouped across other diversity dimensions, such as predicted paratopes or three-dimensional structures. Here we benchmarked antibody grouping methods using clonotype, sequence, paratope prediction, structure prediction, and embedding information. The results were benchmarked on two tasks: binder detection and epitope mapping. We demonstrate that on binder detection no method appears to outperform the others, while on epitope mapping, clonotype, paratope, and embedding clusterings are top performers. Most importantly, all the methods propose orthogonal groupings, offering more diverse pools of candidates when using multiple methods than any single method alone. To facilitate exploring the diversity of antibodies using different methods, we have created an online tool-CLAP-available at (clap.naturalantibody.com) that allows users to group, contrast, and visualize antibodies using the different grouping methods.

2.
Bioinform Adv ; 4(1): vbae033, 2024.
Article in English | MEDLINE | ID: mdl-38560554

ABSTRACT

Motivation: Nanobodies are a subclass of immunoglobulins, whose binding site consists of only one peptide chain, bestowing favorable biophysical properties. Recently, the first nanobody therapy was approved, paving the way for further clinical applications of this antibody format. Further development of nanobody-based therapeutics could be streamlined by computational methods. One of such methods is infilling-positional prediction of biologically feasible mutations in nanobodies. Being able to identify possible positional substitutions based on sequence context, facilitates functional design of such molecules. Results: Here we present nanoBERT, a nanobody-specific transformer to predict amino acids in a given position in a query sequence. We demonstrate the need to develop such machine-learning based protocol as opposed to gene-specific positional statistics since appropriate genetic reference is not available. We benchmark nanoBERT with respect to human-based language models and ESM-2, demonstrating the benefit for domain-specific language models. We also demonstrate the benefit of employing nanobody-specific predictions for fine-tuning on experimentally measured thermostability dataset. We hope that nanoBERT will help engineers in a range of predictive tasks for designing therapeutic nanobodies. Availability and implementation: https://huggingface.co/NaturalAntibody/.

3.
PLoS Comput Biol ; 20(3): e1011881, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38442111

ABSTRACT

Antibody-based therapeutics must not undergo chemical modifications that would impair their efficacy or hinder their developability. A commonly used technique to de-risk lead biotherapeutic candidates annotates chemical liability motifs on their sequence. By analyzing sequences from all major sources of data (therapeutics, patents, GenBank, literature, and next-generation sequencing outputs), we find that almost all antibodies contain an average of 3-4 such liability motifs in their paratopes, irrespective of the source dataset. This is in line with the common wisdom that liability motif annotation is over-predictive. Therefore, we have compiled three computational flags to prioritize liability motifs for removal from lead drug candidates: 1. germline, to reflect naturally occurring motifs, 2. therapeutic, reflecting chemical liability motifs found in therapeutic antibodies, and 3. surface, indicative of structural accessibility for chemical modification. We show that these flags annotate approximately 60% of liability motifs as benign, that is, the flagged liabilities have a smaller probability of undergoing degradation as benchmarked on two experimental datasets covering deamidation, isomerization, and oxidation. We combined the liability detection and flags into a tool called Liability Antibody Profiler (LAP), publicly available at lap.naturalantibody.com. We anticipate that LAP will save time and effort in de-risking therapeutic molecules.


Subject(s)
Antibodies , High-Throughput Nucleotide Sequencing , Antibodies/therapeutic use , Probability
4.
Front Mol Biosci ; 10: 1214424, 2023.
Article in English | MEDLINE | ID: mdl-37484529

ABSTRACT

AlphaFold2 has hallmarked a generational improvement in protein structure prediction. In particular, advances in antibody structure prediction have provided a highly translatable impact on drug discovery. Though AlphaFold2 laid the groundwork for all proteins, antibody-specific applications require adjustments tailored to these molecules, which has resulted in a handful of deep learning antibody structure predictors. Herein, we review the recent advances in antibody structure prediction and relate them to their role in advancing biologics discovery.

5.
JMIR Infodemiology ; 2(2): e35121, 2022.
Article in English | MEDLINE | ID: mdl-36348981

ABSTRACT

Background: Achieving herd immunity through vaccination depends upon the public's acceptance, which in turn relies on their understanding of its risks and benefits. The fundamental objective of public health messaging on vaccines is therefore the clear communication of often complex information and, increasingly, the countering of misinformation. The primary outlet shaping public understanding is mainstream online news media, where coverage of COVID-19 vaccines was widespread. Objective: We used text-mining analysis on the front pages of mainstream online news to quantify the volume and sentiment polarization of vaccine coverage. Methods: We analyzed 28 million articles from 172 major news sources across 11 countries between July 2015 and April 2021. We employed keyword-based frequency analysis to estimate the proportion of overall articles devoted to vaccines. We performed topic detection using BERTopic and named entity recognition to identify the leading subjects and actors mentioned in the context of vaccines. We used the Vader Python module to perform sentiment polarization quantification of all collated English-language articles. Results: The proportion of front-page articles mentioning vaccines increased from 0.1% to 4% with the outbreak of COVID-19. The number of negatively polarized articles increased from 6698 in 2015-2019 to 28,552 in 2020-2021. However, overall vaccine coverage before the COVID-19 pandemic was slightly negatively polarized (57% negative), whereas coverage during the pandemic was positively polarized (38% negative). Conclusions: Throughout the pandemic, vaccines have risen from a marginal to a widely discussed topic on the front pages of major news outlets. Mainstream online media has been positively polarized toward vaccines, compared with mainly negative prepandemic vaccine news. However, the pandemic was accompanied by an order-of-magnitude increase in vaccine news that, due to low prepandemic frequency, may contribute to a perceived negative sentiment. These results highlight important interactions between the volume of news and overall polarization. To the best of our knowledge, our work is the first systematic text mining study of front-page vaccine news headlines in the context of COVID-19.

6.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35830864

ABSTRACT

Antibodies are versatile molecular binders with an established and growing role as therapeutics. Computational approaches to developing and designing these molecules are being increasingly used to complement traditional lab-based processes. Nowadays, in silico methods fill multiple elements of the discovery stage, such as characterizing antibody-antigen interactions and identifying developability liabilities. Recently, computational methods tackling such problems have begun to follow machine learning paradigms, in many cases deep learning specifically. This paradigm shift offers improvements in established areas such as structure or binding prediction and opens up new possibilities such as language-based modeling of antibody repertoires or machine-learning-based generation of novel sequences. In this review, we critically examine the recent developments in (deep) machine learning approaches to therapeutic antibody design with implications for fully computational antibody design.


Subject(s)
Deep Learning , Antibodies/therapeutic use , Feasibility Studies , Machine Learning
7.
Bioinformatics ; 38(9): 2628-2630, 2022 04 28.
Article in English | MEDLINE | ID: mdl-35274671

ABSTRACT

MOTIVATION: Rational design of therapeutic antibodies can be improved by harnessing the natural sequence diversity of these molecules. Our understanding of the diversity of antibodies has recently been greatly facilitated through the deposition of hundreds of millions of human antibody sequences in next-generation sequencing (NGS) repositories. Contrasting a query therapeutic antibody sequence to naturally observed diversity in similar antibody sequences from NGS can provide a mutational roadmap for antibody engineers designing biotherapeutics. Because of the sheer scale of the antibody NGS datasets, performing queries across them is computationally challenging. RESULTS: To facilitate harnessing antibody NGS data, we developed AbDiver (http://naturalantibody.com/abdiver), a free portal allowing users to compare their query sequences to those observed in the natural repertoires. AbDiver offers three antibody-specific use-cases: (i) compare a query antibody to positional variability statistics precomputed from multiple independent studies, (ii) retrieve close full variable sequence matches to a query antibody and (iii) retrieve CDR3 or clonotype matches to a query antibody. We applied our system to a set of 742 therapeutic antibodies, demonstrating that for each use-case our system can retrieve relevant results for most sequences. AbDiver facilitates the navigation of vast antibody mutation space for the purpose of rational therapeutic antibody design. AVAILABILITY AND IMPLEMENTATION: AbDiver is freely accessible at http://naturalantibody.com/abdiver. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Antibodies , High-Throughput Nucleotide Sequencing , Humans , Antibodies/therapeutic use , Antibodies/genetics , Software
8.
MAbs ; 14(1): 2020082, 2022.
Article in English | MEDLINE | ID: mdl-35104168

ABSTRACT

Therapeutic monoclonal antibodies and their derivatives are key components of clinical pipelines in the global biopharmaceutical industry. The availability of large datasets of antibody sequences, structures, and biophysical properties is increasingly enabling the development of predictive models and computational tools for the "developability assessment" of antibody drug candidates. Here, we provide an overview of the antibody informatics tools applicable to the prediction of developability issues such as stability, aggregation, immunogenicity, and chemical degradation. We further evaluate the opportunities and challenges of using biopharmaceutical informatics for drug discovery and optimization. Finally, we discuss the potential of developability guidelines based on in silico metrics that can be used for the assessment of antibody stability and manufacturability.


Subject(s)
Antibodies, Monoclonal , Biological Products , Computer Simulation , Drug Discovery , Humans
9.
Gigascience ; 122022 Dec 28.
Article in English | MEDLINE | ID: mdl-37983748

ABSTRACT

BACKGROUND: Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs. RESULTS: We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variability in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. CONCLUSIONS: Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it is important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pretrained model.


Subject(s)
Machine Learning , Tandem Mass Spectrometry , Chromatography, Liquid
10.
Bioinformatics ; 38(3): 875-877, 2022 01 12.
Article in English | MEDLINE | ID: mdl-34636883

ABSTRACT

MOTIVATION: Liquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (i) absence of balanced training data with large sample size; (ii) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (iii) lack of benchmarking of ML methods on specific LC-MS problems. RESULTS: We created the MS2AI pipeline that automates the process of gathering vast quantities of MS data for large-scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data are stored in a standardized format amenable for ML, encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides. AVAILABILITY AND IMPLEMENTATION: An open-source implementation of the software can be found at https://gitlab.com/roettgerlab/ms2ai. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Peptides , Tandem Mass Spectrometry , Chromatography, Liquid/methods , Tandem Mass Spectrometry/methods , Peptides/analysis , Software , Proteome/chemistry
11.
Nucleic Acids Res ; 50(D1): D1273-D1281, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34747487

ABSTRACT

Nanobodies, a subclass of antibodies found in camelids, are versatile molecular binding scaffolds composed of a single polypeptide chain. The small size of nanobodies bestows multiple therapeutic advantages (stability, tumor penetration) with the first therapeutic approval in 2018 cementing the clinical viability of this format. Structured data and sequence information of nanobodies will enable the accelerated clinical development of nanobody-based therapeutics. Though the nanobody sequence and structure data are deposited in the public domain at an accelerating pace, the heterogeneity of sources and lack of standardization hampers reliable harvesting of nanobody information. We address this issue by creating the Integrated Database of Nanobodies for Immunoinformatics (INDI, http://naturalantibody.com/nanobodies). INDI collates nanobodies from all the major public outlets of biological sequences: patents, GenBank, next-generation sequencing repositories, structures and scientific publications. We equip INDI with powerful nanobody-specific sequence and text search facilitating access to >11 million nanobody sequences. INDI should facilitate development of novel nanobody-specific computational protocols helping to deliver on the therapeutic promise of this drug format.


Subject(s)
Camelidae/immunology , Databases, Genetic , Neoplasms/therapy , Single-Domain Antibodies/immunology , Amino Acid Sequence/genetics , Animals , Antibodies/classification , Antibodies/immunology , Camelidae/classification , Humans , Immunotherapy/classification , Neoplasms/immunology , Single-Domain Antibodies/classification
13.
J Med Internet Res ; 23(6): e28253, 2021 06 02.
Article in English | MEDLINE | ID: mdl-33900934

ABSTRACT

BACKGROUND: Before the advent of an effective vaccine, nonpharmaceutical interventions, such as mask-wearing, social distancing, and lockdowns, have been the primary measures to combat the COVID-19 pandemic. Such measures are highly effective when there is high population-wide adherence, which requires information on current risks posed by the pandemic alongside a clear exposition of the rules and guidelines in place. OBJECTIVE: Here we analyzed online news media coverage of COVID-19. We quantified the total volume of COVID-19 articles, their sentiment polarization, and leading subtopics to act as a reference to inform future communication strategies. METHODS: We collected 26 million news articles from the front pages of 172 major online news sources in 11 countries (available online at SciRide). Using topic detection, we identified COVID-19-related content to quantify the proportion of total coverage the pandemic received in 2020. The sentiment analysis tool Vader was employed to stratify the emotional polarity of COVID-19 reporting. Further topic detection and sentiment analysis was performed on COVID-19 coverage to reveal the leading themes in pandemic reporting and their respective emotional polarizations. RESULTS: We found that COVID-19 coverage accounted for approximately 25.3% of all front-page online news articles between January and October 2020. Sentiment analysis of English-language sources revealed that overall COVID-19 coverage was not exclusively negatively polarized, suggesting wide heterogeneous reporting of the pandemic. Within this heterogenous coverage, 16% of COVID-19 news articles (or 4% of all English-language articles) can be classified as highly negatively polarized, citing issues such as death, fear, or crisis. CONCLUSIONS: The goal of COVID-19 public health communication is to increase understanding of distancing rules and to maximize the impact of governmental policy. The extent to which the quantity and quality of information from different communication channels (eg, social media, government pages, and news) influence public understanding of public health measures remains to be established. Here we conclude that a quarter of all reporting in 2020 covered COVID-19, which is indicative of information overload. In this capacity, our data and analysis form a quantitative basis for informing health communication strategies along traditional news media channels to minimize the risks of COVID-19 while vaccination is rolled out.


Subject(s)
COVID-19/epidemiology , Data Mining/methods , Mass Media/statistics & numerical data , Public Health/methods , Social Media/statistics & numerical data , Health Resources , Humans , Pandemics , SARS-CoV-2/isolation & purification
14.
MAbs ; 13(1): 1892366, 2021.
Article in English | MEDLINE | ID: mdl-33722161

ABSTRACT

The patent literature should reflect the past 30 years of engineering efforts directed toward developing monoclonal antibody therapeutics. Such information is potentially valuable for rational antibody design. Patents, however, are designed not to convey scientific knowledge, but to provide legal protection. It is not obvious whether antibody information from patent documents, such as antibody sequences, is useful in conveying engineering know-how, rather than as a legal reference only. To assess the utility of patent data for therapeutic antibody engineering, we quantified the amount of antibody sequences in patents destined for medicinal purposes and how well they reflect the primary sequences of therapeutic antibodies in clinical use. We identified 16,526 patent families covering major jurisdictions (e.g., US Patent and Trademark Office (USPTO) and World Intellectual Property Organization) that contained antibody sequences. These families held 245,109 unique antibody chains (135,397 heavy chains and 109,712 light chains) that we compiled in our Patented Antibody Database (PAD, http://naturalantibody.com/pad). We find that antibodies make up a non-trivial proportion of all patent amino acid sequence depositions (e.g., 11% of USPTO Full Text database). Our analysis of the 16,526 families demonstrates that the volume of patent documents with antibody sequences is growing, with the majority of documents classified as containing antibodies for medicinal purposes. We further studied the 245,109 antibody chains from patent literature to reveal that they very well reflect the primary sequences of antibody therapeutics in clinical use. This suggests that the patent literature could serve as a reference for previous engineering efforts to improve rational antibody design.


Subject(s)
Antibodies, Monoclonal/chemistry , Data Mining , Databases, Protein , Immunoglobulin Heavy Chains/chemistry , Immunoglobulin Light Chains/chemistry , Intellectual Property , Legislation, Drug , Patents as Topic , Amino Acid Sequence , Antibodies, Monoclonal/therapeutic use , Drug Design , Immunoglobulin Heavy Chains/therapeutic use , Immunoglobulin Light Chains/therapeutic use
15.
Bioinformatics ; 36(6): 1750-1756, 2020 03 01.
Article in English | MEDLINE | ID: mdl-31693112

ABSTRACT

MOTIVATION: Over the last few years, the field of protein structure prediction has been transformed by increasingly accurate contact prediction software. These methods are based on the detection of coevolutionary relationships between residues from multiple sequence alignments (MSAs). However, despite speculation, there is little evidence of a link between contact prediction and the physico-chemical interactions which drive amino-acid coevolution. Furthermore, existing protocols predict only a fraction of all protein contacts and it is not clear why some contacts are favoured over others. Using a dataset of 863 protein domains, we assessed the physico-chemical interactions of contacts predicted by CCMpred, MetaPSICOV and DNCON2, as examples of direct coupling analysis, meta-prediction and deep learning. RESULTS: We considered correctly predicted contacts and compared their properties against the protein contacts that were not predicted. Predicted contacts tend to form more bonds than non-predicted contacts, which suggests these contacts may be more important than contacts that were not predicted. Comparing the contacts predicted by each method, we found that metaPSICOV and DNCON2 favour accuracy, whereas CCMPred detects contacts with more bonds. This suggests that the push for higher accuracy may lead to a loss of physico-chemically important contacts. These results underscore the connection between protein physico-chemistry and the coevolutionary couplings that can be derived from MSAs. This relationship is likely to be relevant to protein structure prediction and functional analysis of protein structure and may be key to understanding their utility for different problems in structural biology. AVAILABILITY AND IMPLEMENTATION: We use publicly available databases. Our code is available for download at https://opig.stats.ox.ac.uk/. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.


Subject(s)
Computational Biology , Sequence Analysis, Protein , Algorithms , Protein Conformation , Proteins/genetics , Sequence Alignment , Software
16.
Brief Bioinform ; 21(5): 1549-1567, 2020 09 25.
Article in English | MEDLINE | ID: mdl-31626279

ABSTRACT

Antibodies are proteins that recognize the molecular surfaces of potentially noxious molecules to mount an adaptive immune response or, in the case of autoimmune diseases, molecules that are part of healthy cells and tissues. Due to their binding versatility, antibodies are currently the largest class of biotherapeutics, with five monoclonal antibodies ranked in the top 10 blockbuster drugs. Computational advances in protein modelling and design can have a tangible impact on antibody-based therapeutic development. Antibody-specific computational protocols currently benefit from an increasing volume of data provided by next generation sequencing and application to related drug modalities based on traditional antibodies, such as nanobodies. Here we present a structured overview of available databases, methods and emerging trends in computational antibody analysis and contextualize them towards the engineering of candidate antibody therapeutics.


Subject(s)
Antibodies, Monoclonal/chemistry , Antibodies, Monoclonal/immunology , Antibodies, Monoclonal/therapeutic use , Computational Biology/methods , Databases, Protein , Molecular Docking Simulation , Protein Conformation
17.
MAbs ; 11(7): 1197-1205, 2019 10.
Article in English | MEDLINE | ID: mdl-31216939

ABSTRACT

Recently it has become possible to query the great diversity of natural antibody repertoires using next-generation sequencing (NGS). These methods are capable of producing millions of sequences in a single experiment. Here we compare clinical-stage therapeutic antibodies to the ~1b sequences from 60 independent sequencing studies in the Observed Antibody Space database, which includes antibody sequences from NGS analysis of immunoglobulin gene repertoires. Of 242 post-Phase 1 antibodies, we found 16 with sequence identity matches of 95% or better for both heavy and light chains. There are also 54 perfect matches to therapeutic CDR-H3 regions in the NGS outputs, suggesting a nontrivial amount of convergence between naturally observed sequences and those developed artificially. This has potential implications for both the legal protection of commercial antibodies and the discovery of antibody therapeutics.


Subject(s)
Complementarity Determining Regions/genetics , Immunoglobulins/genetics , Immunotherapy/methods , Data Mining , Databases, Genetic , High-Throughput Nucleotide Sequencing , Humans , Immunity, Humoral , Immunoglobulins/therapeutic use
18.
Proc Natl Acad Sci U S A ; 116(10): 4025-4030, 2019 03 05.
Article in English | MEDLINE | ID: mdl-30765520

ABSTRACT

Therapeutic mAbs must not only bind to their target but must also be free from "developability issues" such as poor stability or high levels of aggregation. While small-molecule drug discovery benefits from Lipinski's rule of five to guide the selection of molecules with appropriate biophysical properties, there is currently no in silico analog for antibody design. Here, we model the variable domain structures of a large set of post-phase-I clinical-stage antibody therapeutics (CSTs) and calculate in silico metrics to estimate their typical properties. In each case, we contextualize the CST distribution against a snapshot of the human antibody gene repertoire. We describe guideline values for five metrics thought to be implicated in poor developability: the total length of the complementarity-determining regions (CDRs), the extent and magnitude of surface hydrophobicity, positive charge and negative charge in the CDRs, and asymmetry in the net heavy- and light-chain surface charges. The guideline cutoffs for each property were derived from the values seen in CSTs, and a flagging system is proposed to identify nonconforming candidates. On two mAb drug discovery sets, we were able to selectively highlight sequences with developability issues. We make available the Therapeutic Antibody Profiler (TAP), a computational tool that builds downloadable homology models of variable domain sequences, tests them against our five developability guidelines, and reports potential sequence liabilities and canonical forms. TAP is freely available at opig.stats.ox.ac.uk/webapps/sabdab-sabpred/TAP.php.


Subject(s)
Complementarity Determining Regions , Computer Simulation , Models, Molecular , Antibodies, Monoclonal/chemistry , Antibodies, Monoclonal/genetics , Complementarity Determining Regions/chemistry , Complementarity Determining Regions/genetics , Drug Discovery , Humans
19.
J Immunol ; 201(12): 3694-3704, 2018 12 15.
Article in English | MEDLINE | ID: mdl-30397033

ABSTRACT

Next-generation sequencing of the Ig gene repertoire (Ig-seq) produces large volumes of information at the nucleotide sequence level. Such data have improved our understanding of immune systems across numerous species and have already been successfully applied in vaccine development and drug discovery. However, the high-throughput nature of Ig-seq means that it is afflicted by high error rates. This has led to the development of error-correction approaches. Computational error-correction methods use sequence information alone, primarily designating sequences as likely to be correct if they are observed frequently. In this work, we describe an orthogonal method for filtering Ig-seq data, which considers the structural viability of each sequence. A typical natural Ab structure requires the presence of a disulfide bridge within each of its variable chains to maintain the fold. Our Ab Sequence Selector (ABOSS) uses the presence/absence of this bridge as a way of both identifying structurally viable sequences and estimating the sequencing error rate. On simulated Ig-seq datasets, ABOSS is able to identify more than 99% of structurally viable sequences. Applying our method to six independent Ig-seq datasets (one mouse and five human), we show that our error calculations are in line with previous experimental and computational error estimates. We also show how ABOSS is able to identify structurally impossible sequences missed by other error-correction methods.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Immunoglobulins/genetics , Software , Vaccines/immunology , Algorithms , Animals , Computational Biology , Databases as Topic , Drug Development , Humans , Mice , Protein Conformation , Quality Control , Scientific Experimental Error , Structure-Activity Relationship
20.
J Immunol ; 201(8): 2502-2509, 2018 10 15.
Article in English | MEDLINE | ID: mdl-30217829

ABSTRACT

Abs are immune system proteins that recognize noxious molecules for elimination. Their sequence diversity and binding versatility have made Abs the primary class of biopharmaceuticals. Recently, it has become possible to query their immense natural diversity using next-generation sequencing of Ig gene repertoires (Ig-seq). However, Ig-seq outputs are currently fragmented across repositories and tend to be presented as raw nucleotide reads, which means nontrivial effort is required to reuse the data for analysis. To address this issue, we have collected Ig-seq outputs from 55 studies, covering more than half a billion Ab sequences across diverse immune states, organisms (primarily human and mouse), and individuals. We have sorted, cleaned, annotated, translated, and numbered these sequences and make the data available via our Observed Antibody Space (OAS) resource at http://antibodymap.org The data within OAS will be regularly updated with newly released Ig-seq datasets. We believe OAS will facilitate data mining of immune repertoires for improved understanding of the immune system and development of better biotherapeutics.


Subject(s)
Antibodies/genetics , Data Mining/methods , Immunoglobulins/genetics , Immunotherapy/methods , Animals , Antibody Diversity , Databases, Genetic , High-Throughput Nucleotide Sequencing , Humans , Immunity, Humoral/genetics , Mice , Molecular Sequence Annotation
SELECTION OF CITATIONS
SEARCH DETAIL
...